Reward modeling for mitigating toxicity in transformer-based language models
نویسندگان
چکیده
Transformer-based language models can generate fluent text and be efficiently adapted across various natural generation tasks. However, that are pretrained on large unlabeled web corpora have been shown to suffer from degenerating toxic content social bias behaviors, consequently hindering their safe deployment. Various detoxification methods proposed mitigate model toxicity; however, these struggle detoxify when conditioned prompts contain specific identities related gender, race, or religion. In this study, we propose Reinforce-Detoxify, a reinforcement learning-based method for mitigating toxicity in models. We address the challenge of safety new reward detect unintended towards prediction. The experiments demonstrate Reinforce-Detoxify outperforms existing approaches automatic evaluation metrics, indicating our approach is less prone toward generated content.
منابع مشابه
Power Transformer Modeling for Inrush Current Calculation
The paper documents a new transformer model in ATPDraw called XFMR. This model handles 3-phase transformers with two or three windings. Autotransformers and all Wye and Delta couplings are supported. The model includes an inverse inductance matrix for the leakage description, optional frequency dependent winding resistance, capacitive coupling, and a topologically correct core model (3and 5-leg...
متن کاملModeling Quantum Entanglements in Quantum Language Models
Recently, a Quantum Language Model (QLM) was proposed to model term dependencies upon Quantum Theory (QT) framework and successively applied in Information Retrieval (IR). Nevertheless, QLM’s dependency is based on co-occurrences of terms and has not yet taken into account the Quantum Entanglement (QE), which is a key quantum concept and has a significant cognitive implication. In QT, an entang...
متن کاملTransformer Modeling Based on Standard Frequency Response Measurements
High frequency models of large power transformers are required for analysis of transient interaction phenomena between transformers and the power system. Fast transient overvoltages may lead to transformer dielectric failures. Deeper understanding of the mechanisms may help to take actions against possible damages. This paper describes the simulation principle of transient interaction in matter...
متن کاملA reward-based approach for preference modeling: A case study
Abstract Most of reasoning for decision making in daily life is based on preferences. As other kinds of reasoning processes, there are many formalisms trying to capture preferences, however none of them is able to capture all the subtleties of the human reasoning. In this paper we analise how to formalize the preferences expressed by humans and how to reason with them to produce rankings. Parti...
متن کاملA reward-based approach for preference modeling: A case study
Abstract Most of reasoning for decision making in daily life is based on preferences. As other kinds of reasoning processes, there are many formalisms trying to capture preferences, however none of them is able to capture all the subtleties of the human reasoning. In this paper we analise how to formalize the preferences expressed by humans and how to reason with them to produce rankings. Parti...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Applied Intelligence
سال: 2022
ISSN: ['0924-669X', '1573-7497']
DOI: https://doi.org/10.1007/s10489-022-03944-z